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Abstract 



The choice of the PageRank damping factor is not evident. The Google's choice for 
the value c = 0.85 was a compromise between the true reflection of the Web structure and 
numerical efficiency. However, the Markov random walk on the original Web Graph does not 
reflect the importance of the pages because it absorbs in dead ends. Thus, the damping factor 
is needed not only for speeding up the computations but also for establishing a fair ranking 
of pages. In this paper, we propose new criteria for choosing the damping factor, based on 
the ergodic structure of the Web Graph and probability flows. Specifically, we require that 
the core component receives a fair share of the PageRank mass. Using singular perturbation 
approach we conclude that the value c = 0.85 is too high and suggest that the damping factor 
should be chosen around 1/2. As a by-product, we describe the ergodic structure of the OUT 
component of the Web Graph in detail. Our analytical results are confirmed by experiments 
on two large samples of the Web Graph. 

Keywords: PageRank, Web Graph, Singular Perturbation Theory 

1 Introduction 

Surfers on the Internet frequently use search engines to find pages satisfying their query. However, 
there are typically hundreds or thousands of relevant pages available on the Web. Thus, listing 
them in a proper order is a crucial and non-trivial task. One can use several criteria to sort 
relevant answers. It turns out that the link-based criteria provide rankings that appear to be 
very satisfactory to Internet users. The examples of link-based criteria are PageRank [18] used by 
search engine Google, HITS [T^] used by search engines Teoma and Ask, and SALSA [TO] , 

In the present work we restrict ourselves to the analysis of the PageRank criterion and use the 
following definition of PageRank from [T5]. Denote by n the total number of pages on the Web 
and define the n x n hyperlink matrix P as follows: 



for i,j = 1, ...,n, where di is the number of outgoing links from page i. We recall that the 
page is called dangling if it does not have outgoing links. In order to make the hyperlink graph 
connected, it is assumed that at each step, with some probability, a random surfer goes to an 
arbitrary Web page sampled from the uniform distribution. Thus, the PageRank is defined as a 
stationary distribution of a Markov chain whose state space is the set of all Web pages, and the 
transition matrix is 




(1) 



G = cP + (l-c){l/n)E, 



(2) 
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where E is a matrix whose all entries are equal to one, and c £ (0, 1) is a probability of following 
a hyperlink. The constant c is often referred to as a damping factor. The Google matrix G is 
stochastic, aperiodic, and irreducible, so there exists a unique row vector ir such that 

ttG = it, 7rl = l, (3) 

where 1 is a column vector of ones. The row vector ir satisfying ([3]) is called a PageRank vector, 
or simply PageRank. If we consider a surfer that follows a hyperlink with probability c and jumps 
to a random page with probability 1 — c, then 7T; can be interpreted as a stationary probability 
that the surfer is at page i. 

The damping factor c is a crucial parameter in the PageRank definition. It regulates the level 
of the uniform noise introduced to the system. Based on the publicly available information Google 
originally used c = 0.85. There is the following empirical explanation of this choice, see e.g. [15] : 
it seems that the closer the value of the damping factor to one, the better the graph structure of 
the Web is represented in the PageRank vector. However, when the value of c approaches one, 
the rate of power iteration method slows down significantly. The choice c = 0.85 appears to be 
a reasonable compromise between the two antagonistic objectives. However, in [7] the authors 
argue that choosing the value of c too close to one is not necessarily a good thing to do. Not only 
the power iteration method becomes very slowly convergent but also the ranking of the important 
pages becomes distorted. Independently of [7] , this phenomenon was also mentioned in [2] . We also 
remark that another incentive to reduce c is that it will increase the robustness of the PageRank 
towards small changes in the link structure. That is, with smaller c, one can bound the influence 
of outgoing links of a page (or a small group of pages) on the PageRank of other groups [6] and 
on its own PageRank [2] . 

In the present work, we go further than pj [2] and suggest that even the value c = 0.85 is by far 
too large. Our argument is that one has to make a choice of c to reflect the natural intensity of the 
probability flow in the absorbing Markov chain associated with the Web Graph. Our argument 
is based on the singular perturbation theory [TJ [13l [19l [21] . It turns out that the value of c that 
adequately reflects the flow of probability is very close to 1/2. We note that the value c = 1/2 was 
used in [9] to find gems in scientific citations, where the authors justified this choice by intuitive 
argument discussed in more detail in Section [6] In this work, we present a mathematical evidence 
for setting c = 1/2 in the PageRank formula. 

Of course, a drastic reduction of c considerably accelerates the computation of PageRank by 
numerical methods [3/5, 15] • We would like to mention that choosing smaller value for the damping 
factor could have similar effect on numerical methods as choosing fast decreasing damping function 

a. 

As a by-product of the application of the singular perturbation approach we obtain a refine- 
ment of the graph structure of the Web. We demonstrate that the dead-end strongly connected 
components have unjustifiably large PageRank with damping factor c = 0.85 and by taking c = 0.5 
one can mitigate this problem. The results presented in this work are confirmed by experimental 
data that we obtained from two large samples of the Web Graph, described in Section [2] 

The main contributions of this paper are as follows. First, in Section [3] we describe the 
ergodic structure of the Web Graph and show how this structure changes under assumption that 
the dangling pages have a link to all pages in the Web, as in {!]). In particular, we discover an 
Extended Strongly Connected Component (ESCC) that contains a majority of the Web pages. 
Using the theory of singular perturbations, we find an exact formula for the limiting PageRank 
distribution when c — ► 1. This result immediately implies that the limiting PageRank mass of 
ESCC equals zero. Next, in Section 2] we analytically characterize the PageRank mass of ESCC 
as a function of c, and we obtain simple bounds for this function. Further, in Section we argue 
that c = 1/2 ensures that ESCC receives a fair share of total PageRank mass. We conclude with 
a short discussion of the present results and future research directions in Section [6] 
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2 Datasets 



For our numerical experiments, we have collected two Web Graphs, which we denote by INRIA 
and FrMathlnfo. The Web Graph INRIA was taken from the site of INRIA, the French Re- 
search Institute of Informatics and Automatics. The seed for the INRIA collection was Web page 
www. inria. f r. It is a typical large Web site with around 300.000 pages and 2 millions hyperlinks. 
We have crawled the INRIA site until we have collected all pages belonging to INRIA. 

The Web Graph FrMathlnfo was crawled with the initial seeds of 50 mathematics and infor- 
matics laboratories of France, taken from Google Directory. The crawl was executed by breadth 
first search and the depth of this crawl was 6. The FrMathlnfo Web Graph contains around 
700.000 pages and 8 millions hyperlinks. We expect our datasets to be enough representative. 
This is justified by the fractal structure of the Web [10] . 

The link structure of these two Web Graphs is stored in Oracle database. Due to sparsity of 
the Web Graph and reasonable sizes of our datasets, we can store the adjacency lists in RAM to 
speed up the computation of PageRank and other quantities of interest. This enables us to make 
more iterations, which is extremely important in the case when the damping factor c is close to 
one. Our PageRank computation program consumes about one hour to make 500 iterations for 
the FrMathlnfo dataset and about haft an hour for the INRIA dataset for the same number of 
iterations. Our algorithms for discovering the ergodic structures of the Web Graph are based on 
Breadth First Search and Depth First Search methods, which are linear in the sum of number of 
nodes and links. 

3 Ergodic structure of the Web Graph 

In [Sim] the authors have studied the graph structure of the Web. In particular, in [SHU] it was 
shown that the Web Graph can be divided into three principle components: the Giant Strongly 
Connected Component, to which we simply refer as SCC component, the IN component and the 
OUT component. The SCC component is the largest strongly connected component in the Web 
Graph. In fact, it is larger than the second largest strongly connected component by several orders 
of magnitude. Following hyperlinks one can come from the IN component to the SCC component 
but it is not possible to return back. Then, from the SCC component one can come to the OUT 
component and it is not possible to return to SCC from the OUT component. 

With this structure in mind, we would like to analyze ergodic properties of the random walk 
on the Web Graph. If a node has outgoing links, then such random walk follows one of these links 
with uniform distribution. However, as in the definition of PageRank, we have to define how the 
process evolves when it reaches one of dangling nodes. This choice has a crucial influence on the 
ergodic structure of the associated Markov chain. There are three natural possibilities: 1) the 
process is absorbed in the dangling node; 2) the process moves to the predecessor node, or 3) the 
process moves to an arbitrary node. In this paper we focus on the latter option, which is used in 
the original PageRank model [18] . The first two options are definitely worthy to be considered as 
well, and it is a nice topic for future research. 

Thus, throughout the paper we consider a random walk with transition matrix P given by 
([1]). As we shall see below, the analysis of the ergodic structure of P leads to a more detailed 
description of the OUT component, and it allows us to evaluate the effect of damping factor on 
PageRank. Obviously, the graph induced by P has a much higher connectivity than the original 
Web Graph. In particular, if the random walk can move from a dangling node to an arbitrary 
node with the uniform distribution, then the Giant SCC component increases further in size. We 
refer to this new strongly connected component as the Extended Strongly Connected Component 
(ESCC). First, we note that due to the artificial links from the dangling nodes, the SCC component 
and IN component are now inter-connected and are parts of the Extended SCC. Then, if there are 
dangling nodes emanating from some nodes in the OUT component, these nodes together with all 
their predecessors become a part of the Extended SCC. Let us consider an example of the graph 
presented in Figure [1] Node represents the IN component, nodes from 1 to 3 form the SCC 
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component, and the rest of the nodes, nodes from 4 to 11, are in the OUT component. Node 5 is a 
dangling node, thus, artificial links go from the dangling node 5 to all other nodes. After addition 
of the artificial links, all nodes from to 5 form a new strongly connected component, which is 
the ESCC in this example. 




Figure 1: Example of a graph 



By renumbering the nodes, the transition matrix P can be then transformed to the following 
form 

~ Q 

R T 



P = 



(4) 



where the block T corresponds to the Extended SCC, the block Q corresponds to the part of the 
OUT component without dangling nodes and their predecessors, and the block R corresponds to 
the transitions from ESCC to the nodes in block Q. We refer to the set of nodes in the block Q as 
Pure OUT component. In the example of graph on Figure [1] the Pure OUT component consists 
of nodes from 6 to 11. Typically, the Pure OUT component is much smaller than the Extended 
SCC component. The sizes of all components for our two datasets are displayed in Table 1. We 
would like to note that the zero size of the IN components should not come as a surprise. To crawl 
the Web Graph we have used the Breadth First Search method and have started from important 
pages. Therefore, it is natural that the seed pages belong to the Giant SCC and there is no IN 
component. For the purposes of the present research the absence of the IN component is not a 
problem as the dangling nodes unite the IN and the Giant SCC into the Extended SCC. 
As was observed in [T7] , the PageRank vector can be expressed by the following formula 



7T=— l^J-cP]" 1 . 



(5) 



If we substitute the expression ([4]) for the transition matrix P into (O, we obtain the following 
formula for the part of the PageRank vector corresponding to the nodes in ESCC: 



1 



7Tx 



'-l T [I-cTY 



or, equivalently, 



itt = cx(l — c)ut[I — cT] 



(6) 



where a — tit jn and n-r is the number of nodes in ESCC, and where ut is the uniform distribution 
over all ESCC nodes. We shall also use uq — n — tit, which is the number of nodes in the Pure 
OUT component. 
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First, we note that since matrix T is substochastic, the inverse [J— T] _1 exists and consequently 
ttt — ► as c — * 1. Clearly, as was also observed in [7j, it is not good to take the value of c too 
close to one. Next, we argue that even the value of 0.85 is too large. 

Let us analyze the structure of the Pure OUT component in more detail. It turns out that there 
are many disjoint strongly connected components inside the Pure OUT component. One can see 
the histograms of the SCCs' sizes of the Pure OUT for two our datasets INRIA and FrMathlnfo in 
Figures [2] and [3l In particular, there are many SCCs of size 2 and 3 in the Pure OUT component. 
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Figure 2: Histogram of SCCs' sizes of Pure OUT, INRIA dataset 
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Figure 3: Histogram of SCCs' sizes of Pure OUT, FrMathlnfo dataset 
By appropriate renumbering of the states, we can rehne |4]) as follows: 



P 



Qi 












Qrn 






Si ■ 




So 





Ri ■ 


Rrn 


Ro 


T 



(7) 



For instance, in example of the graph from Figure [TJ the nodes 8 and 9 correspond to block 
Qi, nodes 10 and 11 correspond to block Q 2 , and nodes 6 and 7 correspond to blocks S. 

Since the random walk will be eventually absorbed in one of the Q blocks, we can simplify 
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notations for our further analysis. Namely, define the submatrices 



Ri 



S, 
Ri 



i = 1 



Then the structure ([7|) becomes 



P = 



,...,m; T 







So 
Ro T 





Ri 



Q i, 

Rn 





f 



(8) 



Next, we note that if c < 1, then the Markov chain induced by matrix G is ergodic. However, if 
c = 1, the Markov chain becomes non-ergodic. In particular, if the process moves to one of the Qi 
blocks, it will never leave this block. Hence, the random walk governed by the Google transition 
matrix (|2|) is in fact a singularly perturbed Markov chain. 

According to the singular perturbation theory (see e.g., [U [T3l [191 EI] ) , the PageRank vector 
goes to some limit as the damping factor goes to one. Using the results of the singular perturbation 
theory we can characterize explicitly this limit. 

Proposition 1 Let fii be a limiting stationary distribution of the Markov process governed by P 
when the process settles in Qi, the i-th SCC of the Pure OUT component. Namely, vector fj,i is a 
unique solution of the equations 

(J-iQi = IM, Mi* = 1- 



Then, 



wht 



have 



lim 7r(c) = [tti ■ • ■ ir m 0] 



-Ufll-fl^Riljin 



(9) 



'V± _ 

n n 

for i = 1, ...,m and the zeros at the end of the limiting vector correspond to all nodes, which are 
not in Qi, i = 1, . . . , n, that is, not in any SCC of the Pure OUT component. 



Proof: First, we note that if we make a change of variables e = 1 — c the Google matrix becomes 
a transition matrix of a singularly perturbed Markov chain as in Lemma □ with C = —11 T — P. 
Let us calculate the aggregated generator matrix D. 

D = MCQ = -11 T Q - MPQ 

n 

Using MP = M, MQ = I, and Ml = 1 where vectors 1 are of appropriate dimensions, we obtain 



D 



1 



■11 J Q 



— l[rii + nfUf[I — T] 1 Ril, 



+ n f u f [I - f}- 1 R m l] - I 



Since the aggregated transition matrix D + I has identical rows, its stationary distribution v is just 
equal to these rows. Thus, invoking Lemma[T]we obtain ([5]). 

The second term inside the brackets in formula §§§ corresponds to the PageRank mass that 
an SCC component in Pure OUT receives from the Extended SCC. If c is close to one, then 
this contribution can outweight by far the fair share of the PageRank which is given by For 
instance, in our numerical experiments with c = 0.85, the PageRank mass of the Pure OUT 
component in the INRIA dataset equals 1.95riQ/n, whereas a 'fair share' is uq/ti. In the other 
dataset, FrMathlnfo, the unfairness is even more pronounced: the PageRank mass of the Pure 
OUT component is 2>AAnQ/n. This gives users an incentive to create 'dead-ends': groups of pages 
that link only to each other. In the next sections we quantify the influence of parameter c and 
show that in order to obtain balanced probability flow, c should be taken around 1/2. 



G 





INRIA 


FrMathlnfo 


Total size 


318585 


764119 


Number of nodes in SCC 


154142 


333175 


Number of nodes in IN 








Number of nodes in OUT 


164443 


430944 


Number of nodes in ESCC 


300682 


760016 


Number of nodes in Pure OUT 


17903 


4103 


Number of SCCs in OUT 


1148 


1382 


Number of SCCs in Pure Out 


631 


379 



Table 1: Component sizes in INRIA and FrMathlnfo datasets 

4 PageRank mass of ESCC 

Let us consider the PageRank mass of the Extended SCC component (ESCC) described in the 
previous section. Thus, we continue to analyze the transition matrix in the form presented in 
Our goal now is to characterize the behavior of the total PageRank mass of the ESCC component 
as a function of c S [0, 1]. From ^ we have 



|t(c)||i = 7tt(c)1 = (1 — c)o.ut[I — cT] 1 1 

oo 

= (1 -c)au T ^2c k T k l. 



(10) 



k=0 



Clearly, since T is substochastic, we have ||7Tt(0)||i = a and ||7Tt(1)||i = 0. Also, it is easy to 
show that 

— ||7r T (c)||i = -au T [I - cT]- 2 [I - T]l < 



and 



dc 



— \\tt t (c)\\i = -2au T [I - cT}- 3 T[I - T]l < 0. 



Hence, ||7Tt(c)||i is a concave decreasing function. 

In order to get a better idea about the behavior of this function, we derive a series of bounds. 
If we define 

p = inf [wT fe l] 1/fc , p = sup[wT fe l] 1/fc , 



k>V 



k>l 



then it follows immediately from (|T0|) that 

a(l-c 



1 — cp 



< \\kt(c)\\i < 



a(l - c) 
1 — cp 



(11) 



Now, let Xi be the Perron-Frobenius eigenvalue of T, and let r be a random time when a random 
walk induced by T leaves ESCC given that the initial distribution is uniform on ESCC. It is well 
known that 

vT k l 

Ai = lim P[r > k\r > k - 1} 



lim , , 
k->oo uT^ 1 ! 



Thus, we evaluate Ai iteratively by computing 

. (fc) _ uT k l 
A 1 — 



k > 1, 



(12) 



where the numerator and denominator are simply results of the power iterations of T. From 
the definition of A^'' it is easy to see that if the sequence \[ k \ k > 1, is increasing then the 
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sequence (uT k l) 1 / k , k > 1, is also increasing, and thus in this case p = Ai and p = p\, where 
Pi = utTI = P(t > 1). Then equation (fTTj) becomes 



< 7r r (c) i < — . (13) 

1 — cpi 1 — cAi 

Although in our experiments we indeed observed that the sequence , fc > 1, is increasing for 
both INRIA and FrMathlnfo datasets, this condition is still too strong and we presume that it may 
fail in some cases. In the next proposition we provide much milder and more intuitive conditions 
under which (fl3|) still holds. 

Proposition 2 Let Ai be the Perron- Frobenius eigenvalue of T ', and define p\ — utT\. 
(i) If pi < Ai then 

IM c )lli < Y=i^. ce(o,i). (w) 

(ii) If 1/(1 - pi) < u T [I - T]- 1 ! then 

\\Mc)\h > ce(0,l). (15) 

x cpi 



Proof, (i) The function f(c) — a(l — c)/(l — Aic) is decreasing and concave, and so is 
|7r T (c)||i. Also, ||7T T (0)||i = /(0) = a, and ||7Tt(1)||i = /(I) = 0. Thus, for c G (0,1), the plot 
of ||7Tt(c)||i is either entirely above or entirely below /(c). In particular, if the first derivatives 
satisfy |K^(0)||i < /'(0), then ||7r r (c)||i < /(c) for any c G (0,1). Since /'(0) = a(Ai - 1) and 

IKt(0)||i — a {Pi ~ 1); we see th & t Pi < Ai implies (|T4| . 

The proof of (ii) is similar. We consider a concave decreasing function g(c) ~ a(l — c)/(l —p\c) 
and note that g(Q) = a, g(l) = 0. Now, if the condition in (ii) holds then g'(l) > ||7ij.(l)||i, which 
implies (p~5|) . 

Note that the conditions of Proposition [2] are satisfied when the sequence A^ , k > 1, is 
increasing in fc. 

The conditional < Ai, which gives the upper bound, has a clear intuitive interpretation. Let ttt 
be a quasi-stationary distribution of T. By definition, ttt is the probability-normed left Perron- 
Frobenius eigenvector of T, and it is well-known that ttt is a limiting probability distribution 
obtained under condition that the random walk does not leave the ESCC component (see e.g. 
20J). Hence, tttT = \ittt, and the condition p\ < Ai means that the chance to stay in ESCC for 
one step in the quasi-stationary regime is higher than starting from the uniform distribution ut- 
This inequality looks quite natural, since the quasi-stationary distribution should somehow favor 
states, from which the chance to leave ESCC is lower. Therefore, although p\ < Ai does not hold 
in general, one may expect that it should hold for transition matrices describing large entangled 
graphs. 

With the help of the derived bounds we can conclude that the function ||7Tx(c)||i decreases 
very slowly for small and moderate values of c, and it decreases extremely fast when c becomes 
close to 1. This typical behavior is clearly seen in Figures SI O where ||7Tt(c)||i is plotted with a 
solid line. In order to evaluate ||7Tt(c)||i we did not compute it separately for different values of 
c but rather presented it as a function of c so that any value of c could be substituted. For that, 
we stored the values \\ , k > 1, and then used (fT0|) and (fT2| to obtain 

oo k 

IMc)!^ a X> fe II A i fe) ' cg[0,1]. 

k=0 1=1 



This is a more direct approach compared to [7] , where the authors used derivatives of the PageRank 
to present 7r as a function of c. As for the bounds, the values Ai and p\ can be directly substituted 
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Figure 4: PageRank mass of ESCC and bounds, INRIA dataset 
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Figure 5: PageRank mass of ESCC and bounds, FrMathlnfo dataset 

in (Hl|) and (15]), respectively. For the INRIA dataset we have pi = A^ } = 0.97557, Ai = 0.99954, 
and for the FrMathlnfo dataset we have pi = 0.99659, Ai = 0.99937. 

In the next section we use the above results on ||7Pr(c)||i and its bounds to determine the 
values of c that reflect natural probability flows through the ESCC component. 

5 Why the damping factor should be 1/2 

Since ESCC is by far more important and interesting part of the Web than the Pure OUT com- 
ponent, it would be reasonable to ensure that the PageRank mass of ESCC is at least the fraction 
of nodes in this component (we denoted this fraction by a). However, because ||7rr(c)||i is de- 
creasing, and ||7Tt(0)||i = a, it follows that the total PageRank mass of ESCC is smaller than a 
for any value c > 0. 

Now let us discuss an 'optimal' choice of c. First of all, c can not be too close to one because in 
this case the PageRank mass of the giant ESCC component will be close to 0. This was observed 
independently in |2J [7] . Specifically, from the analysis above it follows that the value of c should 
not be chosen in the critical region where the PageRank mass of the ESCC component is rapidly 
decreasing. Luckily, the shape of the function ||7Tt(c)||i is such that it decreases drastically only 
when c is really close to one, which leaves a lot of freedom for choosing c. In particular, the famous 
Google constant c = 0.85 is small enough to ensure a reasonably large PageRank mass of ESCC. 
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However, as we have observed in Section [HJ even moderately large values of c result in an 
unfairly large PageRank mass of the Pure OUT component. Now, our goal is to find the values of 
c that lead to a 'fair' distribution of the PageRank mass between the Pure OUT and the ESCC 
components. 

Formally, we would like to define a number 7 € (0, 1) such that a desirable PageRank mass of 
ESCC could be written as ja, and then find the value c* that satisfies 

IKx(c*)||i = ia. 

Then c < c* will ensure that ||ttt(c)||i > ja. Naturally, 7 should somehow reflect the properties 
of the substochastic block T. For instance, as T becomes closer to stochastic matrix, 7 should 
also increase. One possibility to do it is to define 

7 = vTl, 

where v is a row vector representing some probability distribution on ESCC. Then the damping 
factor c should satisfy 

c < c*, 

where c* is given by 

||7T T (c*)||i =avTl. (16) 

In this setting, 7 is a probability to stay in ESCC for one step if initial distribution is v. For given 
v, this number increases as T becomes closer to stochastic matrix. Now, the problem of choosing 
7 comes down to the problem of choosing v. The advantage of this approach is twofold. First, 
we still have all the flexibility because, depending on v, the value of 7 may be literally anything 
between zero and one. Second, we can use a probabilistic interpretation of v to make a reasonable 
choice. In this paper we consider three appealing choices of v: 

1. ttt, the quasi-stationary distribution of T, 

2. the uniform vector ut, and 

3 . the normalized PageRank vector ttt (c) / 1 1 ttt (c) 1 1 1 . Note that in this case both v and 7 depend 
on c. 

First, let us take v = ttt, the quasi-stationary distribution of T . The motivation for taking 
v = ttt is that ttt weights the states according to their quasi-stationary probabilities, which 
captures the structure of T. With v = ttt, equation (fT!)]) becomes 

7Tt(c*)1 = cxtttTI = a\\. 

In this case, 7 = Ai is the probability that the random walk stays in ESCC given that it did 
not leave this block for infinitely long time. Hence, Ai is a natural measure of proximity of T to 
stochastic matrix. 

If conditions of Proposition [2] are satisfied, then (fl4|) and (|15[) hold, and thus the value of c* 
satisfying (JTHJ) must be in the interval (ci, C2), where 

(1 - Cl )/(1 - PlCl ) = Ai, (1 - ca)/(l - Aic 2 ) = Ai. 

It is easy to check that ci = (1 — Ai)/(1 — Aipi) and c 2 = l/(Ai + 1). Since Ai is very close to 
1, it follows that c is bounded from above by the number c* < C2, where C2 is only slightly larger 
than 1/2! Numerical results for our two datasets are presented in Table El 

From the numerical results we can see that in case when v = ftx, we obtain c* close to zero. 
This however leads to ranking that takes into account only local information about the Web Graph. 
Specifically, the number of incoming links will play a dominant role in the PageRank value (see 
e-g- [H])- Furthermore, the interpretation of v — ttt also suggests that it is not the best choice 
because the 'easily bored surfer' random walk that is used in PageRank computations never follows 
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V 


c 


INRIA 


FrMathlnfo 




Cl 


0.0184 


0.1571 




C2 


0.5001 


0.5002 




C* 


.02 


.16 




C3 


0.5062 


0.5009 




c 4 


0.9820 


0.8051 




c* 


.604 


.535 


7Tt/ 7Tt|[i 


1/(1 + Ai) 


0.5001 


0.5002 




V(l+Pi) 


0.5062 


0.5009 



Table 2: Values of c* with bounds. 



a quasi-stationary distribution. Indeed, with probability (1 — c), this random walk restarts itself 
from the uniform probability vector. Clearly, the intervals between subsequent restarting points 
are too short to reach a quasi-stationary regime. 

Our second choice is the uniform vector v = ut- In this case, (|16|) becomes 

||7Tt(c*)||i = cxutTI — ap\. 

If the conditions of Proposition [2] hold then we again can use (Tl4|) and (fl~5|) to establish that 
c* G (c 3 , C4), where 

(1 ~C 3 )/(1 -PIC3) =Pl, (1 -C 4 )/(l - A1C4) =pi. 

The values of c 3 = 1/(1 +Pi), c 4 = (1— pi)/(l — Aipi), and c* for our datasets are given in Table 1. 
As we see, in this case, we have obtained a higher upper bound. However, the values of c* are still 
much smaller than 0.85. 

Note that v — ttt implies 7 — Ai, which is a probability to stay in ESCC for one step after 
infinitely long time, and v — ut leads to 7 = pi, which is the probability to stay in ESCC for one 
step after starting afresh. Our third choice, the normalized PageRank vector v — 7Tt / 1 1 7Tr 1 1 1 , is 
a symbiosis of the previous two cases. With this choice of v, according to (fT6|) . the value c = c* 
solves the equation 

I Kt(c) Hi = I, — YTiT n T (c)Tl 
\\itt(c)\\i 



u T [I - cT^Tl 



I Kr(c) ||1 

where the last equality follows from ©. Multiplying by 1 1 ttt" (c) 1 1 1 , we obtain 

I kr(c) II? = a 2 (l - cW- cT[I - cT]- 1 ! 

c 

= a 2 (l - c)u T - cT[I - cT]- 1 ! 

c 



a 



2 {l~c)u T - ([I - cT}- 1 - I) 1 



a II ( Ml ( X ~ C ) a2 

- Ft(c 1 

c c 



Solving the quadratic equation for ||7Pr(c)||i, we get 



IKt(c)||i = r(c) = I Z(i-c) 

\ C 



a ifc<l/2, 
if c> 1/2. 



Hence, the value c* solving (fl6|) corresponds to the point where the graphs of ||7Tt(c)||i and r(c) 
cross each other. First, note that there is only one such point on (0,1). Furthermore, since 
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Figure 6: The value c* with v = ttt/HtttIIi is the crossing point of r(c) and ||vtt(c)||i. 

||7Pr(c)||i decreases very slowly unless c is close to one, and r(c) starts decreasing relatively fast 
for c > 1/2, one can expect that c* is only slightly larger than 1/2. This is illustrated in Figure [6J 
where we depict ||7Tt(c)||i and r(c) for INRIA and FrMathlnfo datasets. 

Under conditions of Proposition [2l we may use (|14| and (fTSf to deduce that r(c) first crosses 
the line a(l — c)/(l — Aic), then ||7Tt(c)||i, and then a(l — c)/(l — Pic). Thus, we yield 

1 1 
1 + Ai l+pi 

Since both Ai and p\ are close to 1, this clearly indicates that c should be chosen close to 1/2. 
The values for lower and upper bounds for c* are given in Table 1. Since these bounds are tight 
we did not compute c* explicitly. 

To summarize, our results indicate that with c = 0.85, the ESCC component does not receive a 
fair share of the PageRank mass. Remarkably, in order to satisfy any of the three intuitive criteria 
of fairness presented above, the value of c should be drastically reduced. In particular, the value 
c = 1/2 looks like a well justified choice. 

In future, it would be interesting to design and analyze other criteria for choosing the most 
'fair' value of c. However, given the outcome of our studies, we foresee that any criterion based 
on the PageRank mass of ESCC will lead to similar results. 




6 Conclusions 

The choice of the PageRank damping factor is not evident. The old motivation for the value 
c = 0.85 was a compromise between the true reflection of the Web structure and numerical 
efficiency. However, the Markov random walk on the Web Graph does not reflect the importance 
of the pages because it absorbs in dead ends. Thus, the damping factor is needed not only for 
speeding up the computations but also for establishing a fair ranking of pages. 

In this paper, we proposed new criteria for choosing the damping factor, based on the ergodic 
structure and probability flows. Our approach leads to the conclusion that the value c = 0.85 is 
too high, and in fact the damping factor should be chosen close to 1/2. 

As we already mentioned before, the value c = 1/2 was used in [9] to find gems in scientific 
citations. This choice was justified intuitively by stating that researchers may check references in 
cited papers but on average they hardly go deeper than two levels, which results in probability 
1/2 of 'giving up'. Nowadays, when search engines work really fast, this argument also applies to 
Web search. Indeed, it is easier for the user to refine a query and receive a proper page in fraction 
of seconds than to look for this page by clicking on hyperlinks. Therefore, we may assume that a 
surfer searching for a page, on average, does not go deeper than two clicks. 
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Even if our statement that c should be 1/2, might be received with a healthy skepticism, 
we hope to have convinced the reader that the study of ergodic structure of the Web helps in 
choosing the value of the damping factor, and in improving link-based ranking criteria in general. 
We believe that future research in this direction will yield new reasoning for a well grounded choice 
of the ranking criteria and help to discover new fascinating properties of the Web Graph. 
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Appendix: A Singular Perturbation Lemma 



Lemma 1 Let G(e) = P + eC be a transition matrix of perturbed Markov chain. 

The perturbed Markov chain is assumed to be ergodic for sufficiently small e different from 
zero. And let the unperturbed Markov chain (e = 0) have m ergodic classes. Namely, the transition 
matrix P can be written in the form 



P = 



Qi 
o 

Ri 







Rn 





T 



EK" 



Then, the stationary distribution of the perturbed Markov chain has a limit 



lim 7r(e) = [v\^\ 

£^0 



"mfc 0], 



where zeros correspond to the set of transient states in the unperturbed Markov chain, /Ltj is a 
stationary distribution of the unperturbed Markov chain corresponding to the i-th ergodic set, and 
vi is the i-th element of the aggregated stationary distribution vector that can be found by solution 

i>D = v, 1/1 = 1, 

where D = MCQ is the generator of the aggregated Markov chain and 



M = 



Mi 






Mm 





€ BP 



e R r ' 



with ^ = [1- T] -1 i2jl. 

The proof of this lemma can be found in [TJ [T31 [5T] . 



References 

[1] K. Avrachenkov, Analytic Perturbation Theory and its Applications, PhD thesis, University 
of South Australia, 1999. 



13 



[2] K. Avrachenkov and N. Litvak, "The Effect of New Links on Google PageRank", Stochastic 
Models, v.22, pp.319-331, 2006. 

[3] K. Avrachenkov, N. Litvak, D, Nemirovsky and N. Osipova, "Monte Carlo methods in PageR- 
ank computation: When one iteration is sufficient" , in press at SIAM Journal on Numerical 
Analysis, 2006. 

[4] R. Baeza- Yates, P. Boldi and C. Castillo, "Generalizing PageRank: damping functions for 
link-based ranking algorithms", in Proceedings of ACM SIGIR 2006, pp. 308-315. 

[5] P. Berkhin, "A Survey on PageRank Computing" , Internet Mathematics, v. 2, pp. 73-120, 2005. 

[6] M. Bianchini, M. Gori and F. Scarselli Inside PageRank. ACM Trans. Internet Technology, 
v.5, no.l, pp. 92-128, 2005. 

[7] P. Boldi, M. Santini and S. Vigna, "PageRank as a function of the damping factor", in 
Proceedings of WWW 2005, pp.557-566. 

[8] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and 
J. Wiener, "Graph structure in the Web", Computer Networks, v. 33, pp. 309-320, 2000. 

[9] P. Chen, H. Xie, S. Maslov and S. Redner, "Finding Scientific Gems with Google", Arxiv 
preprint \physics/0604 1 3~0\ 2006. 

[10] S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, "Self- 
similarity in the Web", ACM Trans. Internet Technology v.2, no.3, pp.205-223, 2002. 

[11] S. Fortunato and A. Flammini Random Walks on Direct Networks: the Case of PageRank 
arXiv.org,physics |physics /0604203 

[12] J. Klcinberg, "Authoritative sources in a hyperlinked environment", Journal of ACM, v. 46, 
pp.604-632, 1999. 

[13] V.S. Korolyuk and A.F. Turbin, "Mathematical Foundations of the State Lumping of Large 
Systems" , Kluwer, 1993, Translation of Russian Edition from 1978. 

[14] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tompkins and E. Upfal, "The 
Web as a graph" , PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART 
Symposium on Principles of Database Systems, pp. 1-10, 2000. 

[15] A.N. Langville and CD. Meyer, Google's PageRank and Beyond: The Science of Search 
Engine Rankings, Princeton University Press, 2006. 

[16] R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and 
the TKC effect", Computer Networks, v.33, pp.387-401, 2000. 

[17] CD. Moler and K.A. Moler, Numerical Computing with MATLAB, SIAM, 2003. 

[18] L. Page, S. Brin, R. Motwani and T. Winograd, "The pagerank citation ranking: Bringing 
order to the web", Stanford Technical Report, 1998. 

[19] A. A. Pervozvanski and V.G. Gaitsgory, Theory of Suboptimal Decisions: Decomposition and 
Aggregation, Kluwer, 1988. 

[20] E. Seneta, Non-negative matrices and Markov chains, Springer, 1973. 

[21] G.G. Yin and Q. Zhang, Discrete-Time Markov Chains: two-time- scale methods and applica- 
tions, Springer, 2005. 



14 



